Automatic Corpus-based Thai Word Extraction

نویسندگان

  • TANAPONG POTIPITI
  • VIRACH SORNLERTLAMVANICH
  • THATSANEE CHAROENPORN
چکیده

The Thai language is infamous in its ambiguity. One of its important ambiguities is that there is no explicit word boundary, or in other words there is no explicit definition what words are. Traditional methods on defining words, which depend on human judgement, base on unclear criteria or procedures, and have several limitations. This paper describes an automatic statistical method Thai word extraction from plain Thai text, by employing suffix-array, mutual-information and entropy techniques. Experimental results are quite impressive; our algorithm can extract 428 acceptable words from 1 MB of plain Thai text corpus and the accuracy of extraction is about 85 per cent in both training and test corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm VIRACH SORNLERTLAMVANICH, TANAPONG POTIPITI AND THATSANEE CHAROENPORN

Word” is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria or procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation....

متن کامل

Automatic Thai Keyword Extraction from Categorized Text Corpus

Information Extraction (IE) is a process of discovering implicit and potentially important keywords underlying unstructured natural-language text corpus. Most previously proposed solutions to IE were accomplished by constructing a set of words from given text corpus during the preprocessing step. Due to the inherent chracteristic of Thai written language which does not explicitly use any word d...

متن کامل

Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm

Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria o1" procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation...

متن کامل

Multi-stage Annotation using Pattern-based and Statistical-based Techniques for Automatic Thai Annotated Corpus Construction

An automated or semi-automated annotation is a practical solution towards largescale corpus construction. However, special characteristics of Thai language, such as lack of word-boundary and sentenceboundary markers trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. Two chu...

متن کامل

A Unified Model of Thai Romanization and Word Segmentation

Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a ful...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000